Introduction

Live worksheet

https://ash-bell.github.io/BIOM549-RNA-Seq-Tutorial/

Welcome to the BIOM549 bioinformatics workshop. The goal of this workshop is to provide you with some basic insight in how RNA-Seq data sets are analysed. Using this worksheet we will cover the following Intended Learning Outcomes (ILOs) for this module.

  1. Interpret experimental data obtained using experimental techniques
  2. With minimal guidance, deploy established techniques of analysis, practical investigation and enquiry within the biosciences
  3. Critically evaluate experimental data

Timetable 2022

There are two sessions for this workshop both taught in person. It is not compulsory to attend but, these are the only times I am teaching. I would highly recommend you attend them.

Date & Time Location Items
Wed 30 Nov 13:30 - 16:30 Old Library 137 Installing R and analysis walkthrough
Fri 16 Dec 10:30 - 13:00 Amory B105 How to present your data

If you have any questions or queries outside of these times, you can email me at and I will get back to you when I can.

Preface

In your first stream you should have either worked with zebrafish, plants or fungi. Each of those organisms would have been exposed to differing environmental conditions, which in turn, may result in them expressing different genes in response. Knowledge of which genes are expressed in different exposure conditions help us understand the host’s response to stimuli. Editing or monitoring these pathways can help increase organism fitness and act as bio-markers of disease. Measuring the difference in gene expression is the goal of RNA sequencing and is what we will be aiming to achieve in this tutorial.

Experimental design

Here you are provided with a data set from the plants stream where Arabidopsis thaliana pif4 mutants are grown at 22\(^\circ\)C and 27\(^\circ\)C with samples collected at 2 and 7 days exposure. Collected samples have their RNA extracted, sequenced and mapped back to their original genome to determine which genes are enriched. From there we get a “count” data set, representing the number of RNA sequences mapping back to each Arabidopsis gene. From here, we can devise several experimental questions.

Experimental questions

  1. How many genes are significantly up-regulated or down-regulated in the pif4 mutant compared to our controls?
  2. What are some examples of gene(s) are significantly up-regulated or down-regulated in the pif4 mutant compared to our controls? What functions are they part of?
  3. Which functions are differentially expressed in our pif4 mutant compared to our controls?

Your task

You are to provide a maximum of three plots that best answer the experimental questions set above for feedback in two weeks time - Tues 13 Dec. See the EXTRA: Better plotting examples for ideas to improve your plots.

During the Tues 13 Dec session I will go through your plots and help troubleshoot any issues. At the end of the second session you should have planned out and started on your poster due at the end of Term 2.

Installation

When you turn up to the sequencing practical, you need to bring your laptop. On your laptop you need to have R and RStudio installed before you can analyse any data. You can find instructions on how to do this below.

  1. Download and install R for Windows/Mac/Linux from the official R website.This it the programming language we will be using to analyse our data. https://cran.rstudio.com/
  2. Install RStudio Desktop (Free version) for Windows/Mac/Linux. This is program that allows us to write in R easily. https://www.rstudio.com/products/rstudio/download/
  3. After installing R and RStudio, open up RStudio. We next need to install R packages - additional programs written in R to analyse our data. To install R packages, copy the following and paste it into the console window. Press enter to run the code in your console.
install.packages(c("tidyverse", "pheatmap", "ggupset", "ggridges", "europepmc"))

4. After tidyverse has been installed (it will take a couple minutes), install the BiocConductor packages the same way. Copy and paste the whole code chunk into the console and press enter. This will take a couple of minutes. After this, you’re done. You never need to re-install any package you have previously installed.

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install(c("DESeq2","clusterProfiler","enrichplot"))

Analysis

This portion starts the analysis. You should have access to three files on the SharePoint drive accessed through the BIOM549 VLE page. These are labelled metadata.csv, count_data.csv and Gene_annotations.tsv. The metadata table is a data frame containing which samples belong to which experimental group. The counts data frame is the output from the pre-processing analysis. This contains the number of reads that map to each Arabidopsis gene. The gene annotation file tell us the function of all genes within the Arabidopsis genome. Lets take a look at the data.

Set-up

First we need create a script. This is like making a word document and records all the work you do. To make a script click File –> New File –> R Script. You can write and run anything here. Click on the save button and save this file somewhere sensible on your computer. Now download the metadata and counts file from SharePoint and save it in the same folder as your script. Everything we are going to do from now on will be done in this folder.

Setting a working directory

Next we need to tell the computer where is folder is located and everything we do should be done here. To do that go to Session –> Set Working Directory –> To Source File Location. A command will now run in the console that sets your working directory setwd(). Copy and paste this command into your RScript, so that when you run this script again, it always starts from the same place. Your script should look something like this.

# Copy and paste your working directory here. This is mine and different to yours
setwd("C:/Users/agb214/OneDrive - University of Exeter/PhD/BIOM549/")

Loading in our libraries

If you recall we installed some R libraries previously. Although we never need to re-install our libraries, we need to tell R we want to use them each time we run our script (We never need to re-install Microsoft Word but we do need to tell windows to open it to use it). To do this, we load in our libraries one by one. Its good practice to include all your libraries at the start of your script so anyone else running your script knows what libraries they need to install to used your code. Copy and paste the following into your script. To run from your script, either click on the line you want to run and press Ctrl+Enter. This will only run the line that your cursor is on. To run multiple lines, highlight them all and press Ctrl+Enter. Do this now.

library(DESeq2)
library(tidyverse)
library(pheatmap)
library(clusterProfiler)
library(enrichplot)
library(ggupset)
library(ggridges)
library(europepmc)

Reading in our data sets

Now we can finally load in and look at our data specifying the first column of our data set is our row names. Again copy this into your script and run the code below.

count_data <- read.csv("count_data.csv", row.names = 1)
metadata <- read.csv("metadata.csv", row.names = 1)
head(count_data)
##           Col.2d.22C.rep1 Col.2d.22C.rep2 Col.2d.22C.rep3 pif4.2d.22C.rep1
## AT1G01010            1225            1001            1256             1222
## AT1G01020            2076            1921            2035             2025
## AT1G01030            1115            1086            1102             1077
## AT1G01040            1733            1517            1795             1701
## AT1G01050            7031            7534            7171             7068
## AT1G01060            6946            8697            7634             7283
##           pif4.2d.22C.rep2 pif4.2d.22C.rep3 Col.2d.27C.rep1 Col.2d.27C.rep2
## AT1G01010             1115             1341            1268            1190
## AT1G01020             1958             2075            1999            2248
## AT1G01030             1157             1107             863             815
## AT1G01040             1627             1634            1532            1657
## AT1G01050             6974             7050            7684            7627
## AT1G01060             7420             7216            5496            4936
##           Col.2d.27C.rep3 pif4.2d.27C.rep1 pif4.2d.27C.rep2 pif4.2d.27C.rep3
## AT1G01010            1252             1198             1129             1318
## AT1G01020            2031             1854             2041             2440
## AT1G01030             919              832              803              908
## AT1G01040            1887             1868             1853             2069
## AT1G01050            7739             7086             7884             6818
## AT1G01060            7206             5438             5505             5767
##           Col.7d.22C.rep1 Col.7d.22C.rep2 Col.7d.22C.rep3 pif4.7d.22C.rep1
## AT1G01010            1919            1908            2077             2263
## AT1G01020            2264            1898            1764             2214
## AT1G01030             806             672             641             1076
## AT1G01040            1873            1982            1968             2224
## AT1G01050            8393            9038            8774             7242
## AT1G01060           11844           13650           12584            10324
##           pif4.7d.22C.rep2 pif4.7d.22C.rep3 Col.7d.27C.rep1 Col.7d.27C.rep2
## AT1G01010             2035             2012            2218            2387
## AT1G01020             2396             1695            2715            2863
## AT1G01030             1084             1054             638             800
## AT1G01040             2144             1884            1618            2177
## AT1G01050             7529             7433            8400            8003
## AT1G01060            10645            13128            9450            7605
##           Col.7d.27C.rep3 pif4.7d.27C.rep1 pif4.7d.27C.rep2 pif4.7d.27C.rep3
## AT1G01010            2331             2311             2145             2202
## AT1G01020            2835             2949             2713             2924
## AT1G01030             761             1426             1092             1236
## AT1G01040            2032             2264             2080             2170
## AT1G01050            7651             7296             7932             7423
## AT1G01060            7344             8177             8648             7675
head(metadata)
##                  Temperature Experimental_group Replicate Days_exposure
## Col.2d.22C.rep1           22           wildtype         1             2
## Col.2d.22C.rep2           22           wildtype         2             2
## Col.2d.22C.rep3           22           wildtype         3             2
## pif4.2d.22C.rep1          22        pif4_mutant         1             2
## pif4.2d.22C.rep2          22        pif4_mutant         2             2
## pif4.2d.22C.rep3          22        pif4_mutant         3             2

The command head() allows us to look at the first 5 rows of our data. If we want to take a closer look we can instead use the View() command which will open the data set in a new tab. Beware using View() on very large data sets, you’ll freeze your computer! For this exercise, you should be fine.

View(count_data)
View(metadata)

As you can see, our count data set is the number of reads that mapped to each of the Arabidopsis genes for each of our samples. The metadata data set tells us which sample belongs to which exposure group. Pretty simple. Now lets see if any of the genes are significantly enriched (or not!) in our data set compare to the controls.

Analysis

Combining data sets into a DESeq2 object

We need to combine our count data with our metadata. We use a package called DESeq2 which helps us keep track of multiple data sets in one experiment. Before we do that, we need to tell DESeq2 which is our control and experimental groups. To do this we set the column “Experimental group” as a factor, meaning they are categories and the first level (category) is “wildtype” or our control. Now we’re good to go.

metadata$Experimental_group <- as.factor(metadata$Experimental_group)
metadata$Experimental_group <- relevel(metadata$Experimental_group, ref = "wildtype")
metadata$Temperature <- as.factor(metadata$Temperature)
metadata$Replicate <- as.factor(metadata$Replicate)

Multi-factors

We also need to account that our experiment has multiple parameters, not just two different experimental groups (pif4 and wildtype). DESeq2 can account for this if you tell it. Here we define all our groups, Temperature, Experimental group (wildtype or mutant) and Days exposure (number of days exposed to each temperature).Here we also include an interaction term: Days_exposure:Temperature which tells us if the number of days exposed to each temperature together have an compounded effect compared to if they were individually measured.

ddsMF <- DESeqDataSetFromMatrix(countData = count_data,
                              colData = metadata,
                              design = ~Experimental_group)
design(ddsMF) <- formula(~ Temperature + Experimental_group + Days_exposure, Days_exposure:Temperature)
ddsMF
## class: DESeqDataSet 
## dim: 37336 24 
## metadata(1): version
## assays(1): counts
## rownames(37336): AT1G01010 AT1G01020 ... ATMG09960 ATMG09980
## rowData names(0):
## colnames(24): Col.2d.22C.rep1 Col.2d.22C.rep2 ... pif4.7d.27C.rep2
##   pif4.7d.27C.rep3
## colData names(4): Temperature Experimental_group Replicate
##   Days_exposure

Notice when we type in ddsMF into our console, we now have a DESeqDataSet. It has told us we have a metadata table and count assay, with our row names as genes, columns as samples and metadata categories as Temperature, Experimental_group, Replicate and Days_exposure. Great!

Pre-filtering

We should remove genes that aren’t expressed frequently. This removes the “noise” which helps us find significant trends. These genes that are rarely expressed need to be in higher quantities so we can be sure they are significantly expressed, so we remove them. Here we remove any gene count that isn’t at least 1000 in X samples.

# Get the length of the number of columns in our "count_data" data set
X <- length(colnames(count_data))
# Subset our DESeq2 object where the minimum gene count across our entire data set is over 1000 per gene
keep <- rowSums(counts(ddsMF) >= 1000) >= X
ddsMF <- ddsMF[keep, ]

The great thing about DDSeq2 is it does all our calculations for us. There are genes in our data set. Imagine manually filtering each one out! This is the power of using a coding language to analyse large data sets.

Differential expression analysis

Now we can calculate if the genes are statistically differential expressed. We use the p.adjusted value to account for multiple retesting - the idea that if you test for everything, by random chance you will find something that is significant but its just a “lucky” pattern. If you’re interested, you can find out more here.

ddsMF <- DESeq(ddsMF)

# Define our p-value (alpha) cutoff as 0.05
resMF <- results(ddsMF, alpha = 0.05)

# Give us a quick summary of the results
# LFC > 0 (up) is number of up regulated genes
# LFC < 0 (down) is the number of down regulated genes
summary(resMF)
## 
## out of 11593 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 4457, 38%
## LFC < 0 (down)     : 4171, 36%
## outliers [1]       : 0, 0%
## low counts [2]     : 0, 0%
## (mean count < 1195)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

Over half of the genes are up-/down-regulated. However, its expected that at different temperatures and times gene expression will be different. If we just compare the experimental groups

resMF <- results(ddsMF, contrast = c("Experimental_group", "wildtype", "pif4_mutant"), alpha = 0.05)
## [1] "Number of significantly up/down regulated genes: 680"
## [1] "Total number of genes in this dataset: 11593"

Now accounting for differences in temperature, repeats and days exposure, we have qa better picture and differences between the mutant and control.

Plotting

So now we know there are a total number of genes that are significantly up/down regulated. But how does this look compared to the entire data set? We can plot this as a volcano plot.

plotMA(resMF, xlim=c(1000,20000), ylim=c(-0.5,0.5))

The dots in blue are significantly up/down regulated and the ones in grey not. Lets take a closer look at the most significantly differentially regulated gene.

plotCounts(ddsMF, gene=which.min(resMF$padj), intgroup="Experimental_group")

We can definitely see this gene is down-regulated in the pif4 mutant.

Multigene analysis

Its great that we can see genes that are differentially regulated, but what if we want to see more then one? We could plot multiple dot/boxplot like above, but that gets more difficult to visualise the more genes you have. An alternative is a heat map which, plots multiple genes at once.

resSig <- resMF[resMF$padj < 0.05, ]
norm_counts <- counts(ddsMF, normalized = T) %>%
              data.frame()
filtered <- ddsMF[rownames(ddsMF) %in% rownames(resSig), ]
pheatmap(assay(filtered), show_rownames = F)

We can definitely see differences in multiple genes now. Lets pick the top 10 most signficantly differentially expressed genes.

resSig <- resMF[resMF$padj < 0.05, ]
resSig <- resSig[order(resSig$padj), ][1:10, ]
filtered <- ddsMF[rownames(ddsMF) %in% rownames(resSig), ]
pheatmap(assay(filtered))

This helps us identify by eye which genes are worth a closer look at. There is a clear division between pif4 mutant and control experimental groups.

Functional analysis

Its great that we now know which genes are differentially expressed, but what do these genes do? We therefore need to determine the function of each gene in the Arabidopsis genome and we can then speculate if gene X is enriched in a certain treatment, it might result in Y. This is partly why we work on a model organsism such as Arabidopsis as there are hundreds if not thousands of studies that have already characterised the function of genes for us. Theses are then classified as genome ontology terms (GO terms) which give a curated name and function to each gene.This is because many studies may refer to a gene differently, so we standardised the names and functions in a database. I have already downloaded the gene name to Go term database for you, and you can find more about that here. Lets load in the database and take a look

annotations <- read.csv("Gene_annotations.tsv", sep = "\t")
head(annotations)
##       Locus TAIR.internal.locus.id Gene.Model.s.
## 1 AT1G01010                2200935     AT1G01010
## 2 AT1G01010                2200935     AT1G01010
## 3 AT1G01010                2200935     AT1G01010
## 4 AT1G01010                2200935     AT1G01010
## 5 AT1G01010                2200935     AT1G01010
## 6 AT1G01010                2200935     AT1G01010
##                                           GO.term      GO.ID
## 1                       response to abscisic acid GO:0009737
## 2                   response to water deprivation GO:0009414
## 3 cellular response to oxygen-containing compound GO:1901701
## 4       regulation of DNA-templated transcription GO:0006355
## 5     transcription cis-regulatory region binding GO:0000976
## 6       regulation of DNA-templated transcription GO:0006355
##   TAIR.internal.GO.id category
## 1               11395     proc
## 2                5647     proc
## 3               44644     proc
## 4                7461     proc
## 5               35767     func
## 6                7461     proc
##                                                                                                                       GO.Slim.s.
## 1                                                                         response to chemical | response to endogenous stimulus
## 2                                                       response to abiotic stimulus | response to stress | response to chemical
## 3                                                                                response to chemical | other cellular processes
## 4 nucleobase-containing compound metabolic process | other cellular processes | other metabolic processes | biosynthetic process
## 5                                                                                             nucleic acid binding | DNA binding
## 6 nucleobase-containing compound metabolic process | other metabolic processes | other cellular processes | biosynthetic process
##   Evidence.code                             Reference Made.by
## 1           IEA Publication:501796011|PMID:34562334          
## 2           IEA Publication:501796011|PMID:34562334          
## 3           IEA Publication:501796011|PMID:34562334          
## 4           IEA           AnalysisReference:501756966        
## 5           IPI Publication:501786139|PMID:30356219          
## 6           ISS   Publication:1345963|PMID:11118137          
##    date.last.modified
## 1 10/15/2021 00:00:00
## 2 10/15/2021 00:00:00
## 3 10/15/2021 00:00:00
## 4 09/19/2022 00:00:00
## 5 12/18/2020 00:00:00
## 6 04/01/2021 00:00:00

We can see from this data set we have the gene names (Locus), Go.term (specific function), GO.ID (GO term), category (is this gene function : proc/bp/biological process, func/fc/molecular function or comp/cc/cellular process), Go.Slim.s (general function) and various other details on how the gene function was determined.

Gene Set Enrichment

We want to know if our function (gene set) is differentially expressed in our exposure group compared to our control. However, we have differing numbers of genes that are part of the same function (for example, we may have 10 genes that regulate height, but 100 genes controlling growth rate). One gene that that is part of a small gene set is more influential and changing the overall function of an organism then one gene from a large gene set. Therefore, we need information into the gene set size when determining if a function is differentially expressed. This is calculated as a geneRatio. This is a complicated equation and you can find out more here, but basically it is the fraction of differentially expressed genes found in the gene set. To do that, we use the GSEA function from the library clusterProfiler and plot the results using enrichplot.

The GSEA function requires three inputs, a list of genes and the log2fold change we get from DESeq2’s results function from our multifactor analysis, a term to gene data frame containing our Go terms and gene names, and lastly a term to names data frame with our Go terms to specific Go term functions. We also need to correct for multiple retesting and we use the Benjamini-Hochberg method. We also use an eps of 0 to allow us to test for p-values smaller then the default 1e^-06.

FC <- resMF %>%
  data.frame() %>%
  select(log2FoldChange) %>%
  rownames_to_column(var = "Locus")

geneList <- FC$log2FoldChange
names(geneList) <- as.character(FC$Locus)
geneList = sort(geneList, decreasing = TRUE)

T2G <- annotations %>%
  select(GO.ID, Locus) %>%
  unique()

T2N <- annotations %>%
  select(GO.ID, GO.term) %>%
  unique()

GSE <- GSEA(geneList = geneList, TERM2GENE = T2G, TERM2NAME = T2N, pAdjustMethod = "BH", pvalueCutoff = 0.05, seed = 1234, eps = 0)
## preparing geneSet collections...
## GSEA analysis...
## leading edge analysis...
## done...
GSE
## #
## # Gene Set Enrichment Analysis
## #
## #...@organism     UNKNOWN 
## #...@setType      UNKNOWN 
## #...@geneList     Named num [1:11593] 1.046 0.939 0.86 0.848 0.748 ...
##  - attr(*, "names")= chr [1:11593] "AT5G07010" "AT5G35940" "AT5G22500" "AT5G02540" ...
## #...nPerm     
## #...pvalues adjusted by 'BH' with cutoff <0.05 
## #...48 enriched terms found
## 'data.frame':    48 obs. of  11 variables:
##  $ ID             : chr  "GO:0000325" "GO:0005829" "GO:0022626" "GO:0005773" ...
##  $ Description    : chr  "plant-type vacuole" "cytosol" "cytosolic ribosome" "vacuole" ...
##  $ setSize        : int  198 402 51 105 211 58 488 102 62 38 ...
##  $ enrichmentScore: num  0.484 0.385 0.709 0.536 0.413 ...
##  $ NES            : num  2.36 2.04 2.73 2.39 2.02 ...
##  $ pvalue         : num  1.58e-12 7.52e-12 2.65e-11 3.24e-09 2.11e-08 ...
##  $ p.adjust       : num  6.46e-10 1.53e-09 3.61e-09 3.31e-07 1.73e-06 ...
##  $ qvalue         : num  5.07e-10 1.20e-09 2.83e-09 2.59e-07 1.35e-06 ...
##  $ rank           : num  3230 2835 2595 2726 2696 ...
##  $ leading_edge   : chr  "tags=57%, list=28%, signal=42%" "tags=43%, list=24%, signal=33%" "tags=78%, list=22%, signal=61%" "tags=53%, list=24%, signal=41%" ...
##  $ core_enrichment: chr  "AT1G22740/AT2G27730/AT1G19910/AT1G11260/AT1G20010/AT2G28900/AT1G04820/AT1G58030/AT1G78380/AT1G47128/AT1G04410/A"| __truncated__ "AT1G02340/AT1G12080/AT1G64660/AT2G23600/AT2G03680/AT1G62480/AT1G06850/AT2G29960/AT1G19350/AT1G20440/AT1G70830/A"| __truncated__ "AT1G01120/AT1G67090/AT1G03090/AT2G05840/AT1G20620/AT1G66200/AT2G27720/AT1G43170/AT1G07610/AT1G02780/AT1G67430/A"| __truncated__ "AT2G24280/AT2G23600/AT1G19910/AT2G24270/AT2G22170/AT1G75780/AT1G47128/AT1G11910/AT1G72150/AT2G29550/AT2G25450/A"| __truncated__ ...
## #...Citation
##  T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu.
##  clusterProfiler 4.0: A universal enrichment tool for interpreting omics data.
##  The Innovation. 2021, 2(3):100141

Lets take a look at some of the plots.

upsetplot(GSE)

This is an upset plot showing you in each of the Go Terms, if it has multiple functions (involved in more that one pathway), what its gene ratio is.

dotplot(GSE, showCategory=10, split=".sign") + 
  facet_grid(.~.sign)

For the top ten gene sets which functions are enriched or suppressed, how many Go terms are contribution to that gene set and is the statistically significant.

ridgeplot(GSE, showCategory=10) + 
  labs(x = "Gene Ratio Distribution")
## Picking joint bandwidth of 0.0137

For the top ten gene sets, show the Gene Ratio of each Go term and plot them as distributions.

emapplot(pairwise_termsim(GSE), showCategory = 100)

Of the top 100 gene sets, which ones are related by function and are they statistically significant.

cnetplot(GSE, categorySize="pvalue", foldChange=geneList, showCategory = 1)
## Warning in cnetplot.enrichResult(x, ...): Use 'color.params = list(foldChange = your_value)' instead of 'foldChange'.
##  The foldChange parameter will be removed in the next version.
## Scale for size is already present.
## Adding another scale for size, which will replace the existing scale.

Which genes are contributing to the top gene set and what is the log2fold change for each gene

pmcplot(GSE$Description[1:5], 2010:2018, proportion=FALSE)

Of the top 5 gene sets, between 2010 and 2018, how many studies on PMC mentioned this gene set by name.

EXTRA: Better plotting examples

Volcano plot of significantly enriched genes

library(ggrepel)
vol_plot_data <- resMF %>% 
                  data.frame() %>%
                  mutate(sig = padj < 0.05 & (log2FoldChange > 0.5 | log2FoldChange < -0.5),
                  label = ifelse(sig == "TRUE", row.names(.), ""))


ggplot(vol_plot_data, aes(x = log2FoldChange, y = -log10(padj))) +
  geom_point(aes(colour = sig)) +
  xlab("log2 fold change") + 
  ylab("-log10 adjusted p-value") +
  theme_bw() +
  theme(legend.position = "none") +
  geom_text_repel(aes(label = label), box.padding = 0.5) +
  geom_vline(xintercept = c(-0.5, 0.5), linetype="dotted") +
  geom_hline(yintercept = -log10(0.05), linetype="dotted")

## Boxplot gene of interest

#Select your chosen gene 
df <- plotCounts(ddsMF, gene=which.min(resMF$padj), 
                intgroup=c("Experimental_group", "Temperature"), 
                returnData = T) %>%
                mutate(Temperature = as.factor(Temperature))
df$Temperature <- relevel(df$Temperature, ref = "27")

ggplot(df, aes(x = Experimental_group, y = count, fill = Temperature)) + 
  geom_boxplot() +
  geom_dotplot(binaxis = 'y', stackdir='center', dotsize = 0.75, 
  position=position_dodge(0.75))
## Bin width defaults to 1/30 of the range of the data. Pick better value with
## `binwidth`.

dotplot(GSE, showCategory=10, split=".sign") + 
  facet_grid(.~.sign) + 
  scale_color_gradient(high = "#56B1F7", low = "#132B43")
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.

ridgeplot(GSE, showCategory=10) + 
  labs(x = "Gene Ratio Distribution",
       title = "Top 10 enriched functions") + 
  scale_fill_gradient(high = "#56B1F7", low = "#132B43")
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
## Picking joint bandwidth of 0.0137

resSig <- resMF[resMF$padj < 0.05, ]
resSig <- resSig[order(resSig$padj), ][1:10, ]
filtered <- ddsMF[rownames(ddsMF) %in% rownames(resSig), ]
pheatmap(t(assay(filtered)), 
         color=colorRampPalette(c("blue1", "skyblue", "white", "pink", "red"))(50), 
         display_numbers = T, number_format = "%.0f",
         annotation_row = metadata)